1 BACKGROUND

Emerging only at the end of 2019 and declared by WHO as a global pandemic in March of 2020, COVID-19 has wrought an unprecedented devastation on global population health. By 1 January 2021 83.2 million confirmed cases and 1.8 million deaths had been reported to WHO. And as of May 13 2021, there are 160.1 million confirmed COVID-19 cases and 3.3 million deaths that have been reported to WHO (an additional 76.9 million cases and 1.5 million deaths in 2021 so far).



However, the reported COVID-19 burden has given an incomplete picture of the true impact of covid. There are a number of reasons for this:

  • there have been significant variations in testing access across countries,
  • some countries have better diagnostic capacity than others,
  • and over the course of the pandemic, there have been differences in how COVID-19 deaths have been defined and certified.

As a result of this, in many cases, deaths caused by COVID-19 have been incorrectly attributed to other causes.


1.1 Excess mortality attributable to COVID-19

The excess mortality measure is crucial for accurately quantifying impact. The WHO defines excess mortality as “the mortality above what would be expected based on the non-crisis mortality rate in the population of interest”. Excess mortality is thus the mortality that is attributable to the crisis conditions.

To consider the impact of COVID-19, the mortality rate for year 2020, week \(t\) and age \(x\), can be represented by \[m^o_{x, t}\] and is assumed to be a result of the direct effects of COVID-19 (deaths attributable to it) and the indirect knock-on effects on health systems and society. These indirect effects impact the number of deaths due to other natural as well as external causes. The hypothetical or “counterfactual” no-COVID-19 scenario has expected mortality rate \[m^e_{x, t}\] which is estimated as the average of the most recent death rates (prior to the pandemic) or the forecast of historical rates to 2020. Excess mortality, represented by \(\mu_{x, t}\), can thus be defined as the difference: \[ \mu_{x, t} = m^o_{x, t} - m^e_{x, t} \] evaluated in rate-space to account for population changes and estimated as an absolute rate difference or percentage change. This gives a more comprehensive picture of how a location has been affected by the disease.



There is a data gap to overcome. Routine mortality data is usually received on an annual basis after the year of death or following an even longer lag. Civil registration and vital statistics (CRVS) systems differ across countries with varying timeline and quality control measures for compiling unit record cause-of-death numbers into aggregates indentified by cause, age, sex, place and period of death. In addition, differential reporting capacity and variable data quality across countries has resulted in many nations lacking the systems to provide good quality routine data in the past. Correspondingly, they also lack the capacity and data required to monitor all-cause mortality during this unprecedented pandemic. This results in a number of countries being unable to contribute to a centralized systematic mortality surveillance that would be needed to measure global, regional and country level excess mortality by the WHO.



To track excess mortality globally requires the development of tools and strategies for capturing all-cause mortality that account for the current state of irregular mortality tracking as well as imperfect country capacity. This will be achieved using three complementary exercises:

  • Identifying countries that have timely and complete all-cause numbers for which the counterfactual and the observed mortality, and thus the excess, can be determined directly. This includes identifying non-conventional data sources that may provide intermediate estimates of mortality within the CRVS pipeline;
  • Rapid upscale of the rate at which all-cause mortality is recorded through short term surveillance systems that are built to track all the mortality experienced during this pandemic in real time and to supplement routine systems but are not comprehensive for cause or other granular detail.
  • Predicting the expected excess mortality for places with insufficient data to obtain estimates empirically. This by using statistical models that leverage the estimates from the countries where mortality is known and using them to inform plausible estimates for the countries where it is not, considering country age-, sex-, cause- and income-profiles. We note that such a solution is sub-optimal and vulnerable to the bias implicit in making inference about an unknown quantity based on incomplete information. The optimal long-term goal would be to strengthen country surveillance and to build efficient and reliable death registration systems that are at maximal coverage for all countries. In the absence of these, however, this allows for the tracking of excess mortality of more countries that can be used to generate reasonable regional and global estimates.

Over the next few sections, more detail is provided on the preliminary statistical modeling exercises.

2 DATA INPUT

The code below lists the data sources used as input for the statistical analysis and also shows the wrangling of the data to produce the main input file. URL links for all data files are provided in the code to facilitate reproducibility; and a download link for the final analysis data set is found below the code chunk.

#####################################################################################################################################
# House keeping
rm(list = ls(all = TRUE)) 
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

pacman::p_load(data.table, lubridate, tidyr, dplyr, countrycode)

#####################################################################################################################################
# Additional files for analysis 
#####################################################################################################################################

# 1) Codes for mapping iso2 to iso3 for WHO public data
# 2) Total population numbers by country for 2020 (WPP2019 for 194 Member States)
# 3) GHE data by broad cause group, age, sex and year 2000 to 2019 (183/194 member states)
# 4) GBD data by broad cause group, age, sex and year 2000 to 2019 (11/194 member states)

ccodes      <- fread("https://www.dropbox.com/s/hoefrsvbk3lz389/ccodes.csv?dl=1") 
population  <- fread("https://www.dropbox.com/s/nfmbu67o03kw8o4/population.csv?dl=1")

gheg3       <- fread("https://www.dropbox.com/s/cigt1ahro12qers/group3causes_ghe_183.csv?dl=1") %>%
  mutate(sex       = factor(sex, labels = c("Male", "Female")),
         causename = case_when(causename == "All Causes" ~ "All",
                               causename == "Communicable, maternal, perinatal and nutritional conditions" ~ "Communicable",
                               causename == "Noncommunicable diseases" ~ "Noncommunicable",
                               TRUE ~ as.character(causename))) %>%
  select(iso3, year, age, sex, causename, dths) 

gbdg3       <- fread("https://www.dropbox.com/s/5hpa957mbzk9mrs/group3causes_gbd_11.csv?dl=1") %>%
  rename(agel = age, causename = cause, dths = val) %>%
  left_join(data.table(iso3 = c("AND","COK","DMA","MHL","MCO","NRU","NIU","PLW","KNA","SMR","TUV"),
             location = c("Andorra", "Cook Islands", "Dominica", "Marshall Islands","Monaco", 
                          "Nauru", "Niue", "Palau", "Saint Kitts and Nevis", "San Marino", "Tuvalu")), by = "location") %>% 
  left_join(data.table(age = seq(0,85,5), agel = c("Under 5", paste(seq(5,80,5),"to",seq(5,80,5)+4),"85 plus")), by="agel") %>%
  mutate(sex       = factor(sex, levels = c("Male", "Female")),
         causename = case_when(causename == "All causes" ~ "All",
                               causename == "Communicable, maternal, neonatal, and nutritional diseases" ~ "Communicable",
                               causename == "Non-communicable diseases" ~ "Noncommunicable",
                               TRUE ~ as.character(causename))) %>%
  select(iso3, year, age, sex, causename, dths)

# Combine the GHE and GBD based broad cause numbers

group3_194   <- rbind(gheg3, gbdg3) %>% 
  group_by(iso3, year, causename) %>% summarise(dths = sum(dths), .groups = "drop") %>% ungroup() %>%
  spread(causename, dths) %>% arrange(iso3, year) %>% right_join(population, by = c("iso3","year"))

# Use the 2019 rates for NCDs, Communicable diseases and injuries as potential covariates

covariate.ghe<- group3_194 %>% 
  filter(year == 2019)  %>% select(-c(year, All)) %>%
  gather(cause, val, -iso3, -pop) %>% mutate(val=1e5*val/pop) %>% spread(cause, val) %>% mutate(pop=NULL)

#####################################################################################################################################
# Time-series random walk order 2 for forecast of expected deaths in 2020 
#####################################################################################################################################

# Function to forecast the GHE all cause deaths for a single year

expected.ghe <- function(){
  set.seed(12345)
  iso.list   <- sort(unique(group3_194$iso3)) # vector of iso3 to loop through;  
  ex.list    <- list()                        # empty list to store expected values in;

for (i in 1:194){
  print(paste0(round(100*i/194,1), "%"))
  
  group3.subset <- group3_194 %>% filter(iso3 == iso.list[i]) %>% 
    arrange(year) %>% select(iso3, year, All) %>% mutate(lall = log(All))
  
    fit         <- inla(lall ~ f(year, model = "rw2"), data = group3.subset, control.predictor= list(compute=TRUE))
    fit.out     <- data.table(expected = exp(fit$summary.fitted.values$mean),
                              fit.l    = exp(fit$summary.fitted.values$"0.025quant"),
                              fit.u    = exp(fit$summary.fitted.values$"0.975quant"))
    
    ex.list[[i]]<- cbind(group3.subset %>% mutate(lall = NULL), fit.out)
}
rbindlist(ex.list)
}

# takes a few min to run the function
# expected.out <- expected.ghe()
# fwrite(expected.out, file = "Data/expected.out.csv", row.names = F)

# loading the saved output from run

expected.out <- fread("https://www.dropbox.com/s/yjr1s99s2cmdpdb/expected.out.csv?dl=1") 

ghe.expected <- expected.out %>%
  filter(year == 2020) %>% select(iso3, expected) %>% 
  mutate(expected = round(0.25*expected), observed = NA)

ghe.expected <- rbind(ghe.expected %>% mutate(quarter = "q1"),
                      ghe.expected %>% mutate(quarter = "q2"),
                      ghe.expected %>% mutate(quarter = "q3"),
                      ghe.expected %>% mutate(quarter = "q4")) %>% arrange(iso3, quarter)


#####################################################################################################################################
# WHO COVID case and death data by date and country
#####################################################################################################################################

# Read in the WHO data and clean up some of the names

who_df      <- fread("https://covid19.who.int/WHO-COVID-19-global-data.csv") %>% 
  mutate(Country_code = ifelse(Country == "Namibia", "NA", Country_code)) %>%
  left_join(ccodes, by = "Country_code") %>%
  mutate(Country = case_when(Country == "Curaçao" ~ "Curacao",
                             Country == "Saint Barthélemy" ~ "Saint Barthelemy",
                             Country == "Réunion" ~ "Reunion",
                             Country == "Kosovo[1]" ~ "Kosovo",
                             Country %in% c("Bonaire","Saba","Sint Eustatius") ~ "Bonaire, Sint Eustatius and Saba",
                             TRUE ~ as.character(Country)),
         iso3    = case_when(Country == "Kosovo" ~ "XKX",
                             Country == "Saint Martin"~ "MAF",
                             Country == "Sint Maarten"~ "SXM",
                             Country == "Bonaire, Sint Eustatius and Saba" ~ "BES",
                             Country == "Curacao" ~ "CUW",
                             Country == "Saint Barthelemy" ~ "BLM",
                             Country == "South Sudan" ~ "SSD",
                             Country == "Namibia" ~ "NAM",
                             TRUE ~ as.character(iso3)),
         date    = as.Date(Date_reported,"%y-%m-%d")) %>%
  rename(cov_cases  = Cumulative_cases, cov_deaths = Cumulative_deaths)  

# Subset to the 194 member states and aggregate cases and deaths by quarter of death

who_2020_194 <- who_df %>% 
  filter(date =="2020-04-01" | date =="2020-07-01"| date =="2020-10-01" | date =="2021-01-01") %>%
  mutate(period = case_when(date == "2020-04-01" ~ "p1",
                            date == "2020-07-01" ~ "p2",
                            date == "2020-10-01" ~ "p3",
                            date == "2021-01-01" ~ "p4")) %>% 
  select(Country, iso3, WHO_region, period, cov_cases, cov_deaths) %>%
  gather(source, val, -Country, -iso3, -WHO_region, -period) %>%
  group_by(Country, iso3, WHO_region, period, source) %>%
  summarise(val = sum(val, na.rm = T), .groups = "drop") %>% ungroup() %>% 
  spread(period, val) %>% 
  mutate(q1 = p1, q2 = p2 - p1, q3 = p3 - p2, q4 = p4 - p3) %>%
  select(-c(p1, p2, p3, p4)) %>%  
  gather(quarter, val, -Country, -iso3, -WHO_region, -source) %>%
  spread(source, val) %>%
  right_join(population %>% filter(year == 2020), by = "iso3") %>%
  mutate(WHO_region = factor(WHO_region, 
             levels = c("AFRO","AMRO","EMRO","EURO","SEARO","WPRO"))) %>%
  arrange(iso3, quarter) %>% 
  group_by(iso3) %>% 
  mutate(cc_all = 1e5*sum(cov_cases)/pop, cd_all = 1e5*sum(cov_deaths)/pop) %>% ungroup()


#####################################################################################################################################
# SDI data from GBD and Income group from World Bank
#####################################################################################################################################

# SDI, filter for country locations (leave out region or subnational; & Georgia in the USA)

sdi.loc <- fread("https://www.dropbox.com/s/vhajcoyzw08wb8l/sdi.country.csv?dl=1")
temp    <- tempfile(fileext = ".xlsx")
dataURL <- "http://ghdx.healthdata.org/sites/default/files/record-attached-files/IHME_GBD_2019_SDI_1990_2019_Y2020M10D15.XLSX"
download.file(dataURL, destfile=temp, mode='wb')

sdi.gbd <- readxl::read_excel(temp, skip = 1)[-c(105), ] %>% 
  filter(Location %in% sdi.loc$Location & Location != "Virgin Islands" ) %>%
  mutate(iso3 = countrycode(Location, "country.name", "iso3c")) %>%
  gather(year, sdi, -Location, -iso3) %>%
  mutate(sdi=as.numeric(gsub( "·", ".",sdi)), year = as.numeric(year), Location = NULL) %>%
  filter(year == 2019 & iso3 %in% unique(population$iso3)) %>%
  select(-c(year))

# Income group from World Bank, assume Cook Islands and Niue are upper middle income like Tonga and Tuvalu

temp     <- tempfile(fileext = ".xls")
dataURL  <- "http://databank.worldbank.org/data/download/site-content/CLASS.xls"
download.file(dataURL, destfile=temp, mode='wb')

wbincome <- readxl::read_excel(temp, skip = 4) %>%
  filter(Code != "x") %>% 
  select(Code, `Income group`) %>%
  rename(iso3 = Code, group = "Income group") %>%
  filter(iso3 %in% unique(population$iso3))

wb_oth   <- wbincome  %>% 
  filter(iso3 %in% c("TON","TUV")) %>% mutate(iso3 = c("COK", "NIU")) 

wbincome <- wbincome %>% rbind(wb_oth)
  

#####################################################################################################################################
# Observed compiled data from Karlinsky et al and multiple sources
#####################################################################################################################################

# The baseline are forcasted by time using the linear regression function
# Will compare the aggregate for Ariels forecasts to the aggregate GHE forecasts

baseline_data <- fread("https://raw.github.com/dkobak/excess-mortality/main/baselines.csv") %>%
  rename(country = V1, time = V2, expected = V3) %>%
  filter(!(country %in% c("Kosovo", "Transnistria"))) %>%
  mutate(iso3 = countrycode(country, "country.name", "iso3c"), country = NULL)

# Pull the collated observed data from World Mortality Data
# This step is replicated using HMD, EUROMOMO etc when doing age-disaggregated analysis

observed.df   <- fread("https://raw.github.com/akarlinsky/world_mortality/main/world_mortality.csv", header = T) %>%
  filter(!(country_name %in% c("Kosovo", "Transnistria", 
                               "Uruguay", "Belarus", "Nicaragua","El Salvador")) & year == 2020 & 
           time_unit != "quarterly (Solar Hijri)") %>%
  mutate(iso3 = countrycode(country_name, "country.name", "iso3c")) %>%
  left_join(baseline_data, by = c("iso3","time")) %>% 
  mutate(quarter = ifelse((time_unit == "weekly" & time >= 1 & time <= 13)|
                          (time_unit == "monthly" & time >= 1 & time <= 3)|
                           (time_unit == "quarterly" & time == 1), "q1",
                          ifelse((time_unit == "weekly" & time >= 14 & time <= 26)|
                                   (time_unit == "monthly" & time >= 4 & time <= 6)|
                                   (time_unit == "quarterly" & time == 2), "q2",
                                 ifelse((time_unit == "weekly" & time >= 27 & time <= 39)|
                                          (time_unit == "monthly" & time >= 7 & time <= 9)|
                                          (time_unit == "quarterly" & time == 3), "q3", "q4")))) %>%
  group_by(iso3, quarter) %>% 
  summarise(observed = sum(deaths, na.rm = T), expected = sum(expected, na.rm = T), .groups = "drop") %>% ungroup()

expected.df  <- ghe.expected %>% filter(!(iso3 %in% observed.df$iso3)) %>% rbind(observed.df)

#####################################################################################################################################
# Oxford tracker data on government policy (https://www.nature.com/articles/s41562-021-01079-8)
# Use medians for the quarter to get summary statistic
# Infill missing countries with WHO region means

covid_pol    <- fread("https://github.com/OxCGRT/covid-policy-tracker/raw/master/data/OxCGRT_latest.csv")

covid_pol_df <- covid_pol %>% 
  rename(iso3 = CountryCode) %>%
  mutate(quarter = ifelse(Date < 20200401, "q1", ifelse(Date < 20200701, "q2", ifelse(Date < 20201001, "q3", "q4")))) %>%
  group_by(iso3, quarter) %>%
  summarize(Stringency = median(StringencyIndex, na.rm = T), 
            Government = median(GovernmentResponseIndex, na.rm = T), 
            Containment = median(ContainmentHealthIndex, na.rm = T), 
            Economic = mean(EconomicSupportIndex, na.rm = T), .groups = "drop") %>%
  ungroup() %>% filter(!is.na(Stringency)) %>%
  right_join(who_2020_194 %>% select(iso3, WHO_region, quarter), by = c("iso3", "quarter")) %>%
  group_by(WHO_region, quarter) %>% 
  mutate(Stringency = ifelse(is.na(Stringency), mean(Stringency, na.rm = T), Stringency),
         Government = ifelse(is.na(Government), mean(Government, na.rm = T), Government),
         Containment = ifelse(is.na(Containment), mean(Containment, na.rm = T), Containment),
         Economic = ifelse(is.na(Economic), mean(Economic, na.rm = T), Economic)) %>%
  ungroup() %>% mutate(WHO_region = NULL)


#####################################################################################################################################
# Dataset to use in preliminary statistical models
#####################################################################################################################################

# Now bring all the data together
# Create some rate variables from counts as well as interactions

analysis_mod <- who_2020_194 %>%
  left_join(expected.df, by = c("iso3","quarter")) %>%
  left_join(covid_pol_df, by = c("iso3","quarter")) %>%
  left_join(covariate.ghe, by = "iso3") %>%
  left_join(sdi.gbd, by = "iso3") %>%
  left_join(wbincome, by = "iso3") %>%
  arrange(iso3, quarter) %>%
  rename(cmr = Communicable, injr = Injuries, ncdr = Noncommunicable) %>%
  mutate(group      = factor(ifelse(group == "High income", "HIC",
                             ifelse(group == "Upper middle income", "UMIC", "LMIC")),
                                   levels = c("LMIC", "UMIC", "HIC")),
         cfr        = ifelse(cov_cases == 0, 0, cov_deaths/cov_cases), # deaths per case
         expcov     = expected + cov_deaths,                           # sum of expected and covid
         excess     = round(observed - expected),                      # excess deaths
         ratio      = observed/expcov,                                 # ratio of observed to sum of expected and covid
         Nx         = pop*1e-5,                                        # per 100,000 population
         expcovr    = expcov/Nx,                                       # expected plus covid rate
         excessr    = excess/Nx,                                       # observed - covid rate
         expectedr  = expected/Nx,                                     # expected rate
         covidr     = cov_deaths/Nx,                                   # covid rate
         casesr     = cov_cases/Nx,                                    # cases rate
         covsdi     = covidr * sdi,                                    # interaction covid and SDI
         ncdsdi     = ncdr * sdi,                                      # interaction NCDs and SDI
         quarter    = as.factor(quarter),                              # factor of quarter
         lratio     = log(ratio),                                      # natural log of ratio
         sdrat      = sd(lratio, na.rm = T),                           # standard deviation of log ratio
         mrat       = mean(lratio, na.rm = T),                         # mean of log ratio
         zratio     = (lratio - mrat)/sdrat,                           # standardized log ratio
         class      = factor(cut(right = FALSE, x = covidr,            # direct COVID impact burden broad gategory
                          breaks = c(0, 1, 10, 20, 40, 70, 1e5)),
                          labels = paste(c(0, 1, 10, 20, 40, 70))),
         index      = ifelse(is.na(excess), 0, 1),                     # index on observed vs missing
         row = 1:nrow(who_2020_194)) %>%                               # row index 
  select(Country, iso3, WHO_region, group, index, pop, Nx, quarter,
         observed, expected, ratio, excess, excessr, expectedr, class,
         cov_deaths, covidr, cov_cases, casesr, expcov, expcovr, cc_all, cd_all,
         sdi, covsdi, ncdr, ncdsdi, cmr, cfr, sdrat, mrat, zratio, 
         Stringency, Government, Containment, Economic, row)  

save(analysis_mod, file = "Data/analysis_mod.Rda")

The analysis data used for the models can be downloaded as an R-database using the analysis_mod.RDa link below:

3 FORECASTED BASELINE

One of the key inputs in the estimation of excess mortality is the baseline or counterfactual i.e., the expected deaths in the absence of COVID-19. As shown the in data input, one way that these can be determined for the aggregate of the year is through a one-year step forecast. THis is performed using the GHE data for the historical trend. The forecast assumes a second order random walk (RW2) model which has the density \[\pi(y) \propto exp\left(-\frac{1}{2}\sum_{i=3}^n (y_i - 2y_{i-1} + y_{i-2})^2\right)\] where the term \(y_{i+1} −2y_i +y_{i−1}\) can be interpreted as an estimate of the second order derivative of a continuous time function \(y(t)\) at \(t = i\) using values of \(y(t)\) at \(t = i − 1\), \(i\), and \(i+1\). The plots below show the historical mean values from the GHE as well as the crresponding forecasted estimates and uncertainty. (Note only the mean is used as input). In addition, where available, the expected values for 2020 according to World Mortality Data are also shown.

4 INITIAL MODELS

4.1 LINEAR MODEL

The first option to predicting excess mortality is to build a log-linear model that can be applied according to a Generalized linear models (GLMs) framework. A GLM has the structure

\[g\big(\mu_i\big) = \boldsymbol{x}_i \boldsymbol{\beta}\]

where \(\mu_i\) is the expectation of response variable \(Y_i\) such that \(\mu_i \equiv \mathbb{E}\big(Y_i\big)\), \(g\) is a link function, \(\boldsymbol{x}_i\) is the \(i\)th row of the covariate matrix \(\boldsymbol{X}\), and \(\boldsymbol{\beta}\) is a vector of unknown coefficients. Akin to linear models, GLMs center around a linear predictor, \(\boldsymbol{X}\boldsymbol{\beta}\), and allow for link functions other than the identity function. GLMs also allow for any distribution from the exponential family to model the dependent variable instead of just assuming it Gaussian. We predict the log of the ratio

\[\mu_i = log(s) = log\left(\frac{Observed}{Expected + covid19}\right) = \boldsymbol{x}_i \boldsymbol{\beta}\] for \(x_i \in \{SDI, NCD, COVID\}\) where \(COVID\) represents the reported COVID mortality rate, \(NCD\) are the NCD deaths rates as described in data input as well as the interaction of the SDI and COVID. The code used to predict the scaling factor as well as the estimation of the uncertainty is provided below. Results are in a tab to follow.

#####################################################################################################################################
# Linear regression model using GLM instead of LM incase we change assumption about family
#####################################################################################################################################
rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

#####################################################################################################################################
# Load packages for all statistical analyses
pacman::p_load(highcharter, data.table, tidyr, dplyr, INLA, xgboost, mlr, countrycode)

# Can load the main data file again
load("Data/analysis_mod.Rda")

#####################################################################################################################################
# Aggregating from quarters to the sum for the entire year

df.lin      <- analysis_mod %>% 
  dplyr::select(Country, iso3, WHO_region, cov_deaths, expected, observed, Nx, ncdr, sdi) %>% 
  group_by(Country, iso3, WHO_region, Nx, ncdr, sdi) %>%
  summarize(covid = sum(cov_deaths), expected = sum(expected), 
            observed = sum(observed), .groups = "drop") %>%
  ungroup() %>% 
  mutate(excess = observed - expected, expectedc = expected + covid, 
         ratio = observed/expectedc, covidr = covid/Nx, covsdi = covidr * sdi) %>%
  arrange(iso3)

#####################################################################################################################################
# Indices to identify rows

df.lin      <- df.lin %>% mutate(row = 1:nrow(df.lin))
ind.obsl    <- df.lin %>% filter(!is.na(excess)) %>% pull(row)

#####################################################################################################################################
# Approach
# 1) Run basic linear regression model
# 2) Sample 200 sets of coefficients using distribution given by inverse of variance matrix
# 3) Generate 200 realizations of fitted scaling factors for each country
# 4) Replace scaling factor with observed for places with data
# 5) Apply scaling factors to observed covid + expected to get predicted actual
# 6) Excess is predicted actual - expected
# 7) Use samples to summarise information by country but also for regional summaries

reg.mod     <- glm(log(ratio) ~ . - 1, data = df.lin %>% dplyr::select(ratio, sdi, covidr, covsdi, ncdr))
pe          <- reg.mod$coefficients  # point estimates
vc          <- vcov(reg.mod)         # variance-covariance

set.seed(1234)
draws       <- 200

# Distribution of coefficients
simbetas    <- MASS::mvrnorm(draws, pe, vc)

# Input data to scale the coefficients
df.lin.res  <- df.lin %>% dplyr::select(dimnames(simbetas)[[2]]) %>% as.matrix()

# Distribution of fitted values (these are scaling factors from the ratio of observed to covid plus expected)
replic      <- exp(df.lin.res %*% t(simbetas))

# Use the scaling factors to obtain distribution of excess deaths
fitdist1e   <- replic*replicate(draws, df.lin$expectedc) - replicate(draws, df.lin$expected)
obsvals     <- df.lin$excess # vector of observed values

# For the non-missing, replace predictions with the actual observed
for (i in 1:draws){
  fitdist1e[ind.obsl,] <- obsvals[ind.obsl]
}

The syntax used to generate country, regional and global uncertainty bounds can be shown through the link below.

#####################################################################################################################################
# Basic functions to summarise 80% bootstrap intervals or sample from gaussian
#####################################################################################################################################

getsum <- function(x){
  tibble(mean = mean(x, na.rm = T), 
         lwr  = quantile(x, 0.1, na.rm = T), 
         uppr = quantile(x, 0.9, na.rm = T)) %>% round()
}

getsum2 <- function(x){
  c(mean(x, na.rm = T), quantile(x, 0.1, na.rm = T), quantile(x, 0.9, na.rm = T))
}

get.sample <- function(x){
  rnorm(draws, x[1], x[2])
}
#####################################################################################################################################
# Table of summarized values joined to observed and explanator data
df.lin.out   <- data.table::data.table(t(apply(fitdist1e, 1, getsum2))) %>%
  rename(y = "V1", low = "10%", high = "90%") %>%
  cbind(df.lin)  %>% 
  mutate(y    = ifelse(!is.na(excess), excess, y),
         low  = ifelse(!is.na(excess), NA, low),
         high = ifelse(!is.na(excess), NA, high), 
         source = ifelse(is.na(excess), "Predicted", "Observed")) %>% arrange(-y) %>%
  data.frame()

#####################################################################################################################################
# Indices by region

afro.indl   <- df.lin %>% filter(WHO_region == "AFRO") %>% pull(row)
emro.indl   <- df.lin %>% filter(WHO_region == "EMRO") %>% pull(row)
euro.indl   <- df.lin %>% filter(WHO_region == "EURO") %>% pull(row)
paho.indl   <- df.lin %>% filter(WHO_region == "AMRO") %>% pull(row)
searo.indl  <- df.lin %>% filter(WHO_region == "SEARO") %>% pull(row)
wpro.indl   <- df.lin %>% filter(WHO_region == "WPRO") %>% pull(row)
glob.indl   <- df.lin %>% pull(row)
#####################################################################################################################################

afro.sum   <- getsum(apply(fitdist1e[afro.indl,], 2, sum)) %>% 
  mutate(region = "AFRO", covid = sum(df.lin[afro.indl,]$covid), Nx = sum(df.lin[afro.indl,]$Nx)) 
emro.sum   <- getsum(apply(fitdist1e[emro.indl,], 2, sum)) %>% 
  mutate(region = "EMRO", covid = sum(df.lin[emro.indl,]$covid), Nx = sum(df.lin[emro.indl,]$Nx)) 
euro.sum   <- getsum(apply(fitdist1e[euro.indl,], 2, sum)) %>% 
  mutate(region = "EURO", covid = sum(df.lin[euro.indl,]$covid), Nx = sum(df.lin[euro.indl,]$Nx)) 
paho.sum   <- getsum(apply(fitdist1e[paho.indl,], 2, sum)) %>% 
  mutate(region = "PAHO", covid = sum(df.lin[paho.indl,]$covid), Nx = sum(df.lin[paho.indl,]$Nx)) 
searo.sum  <- getsum(apply(fitdist1e[searo.indl,], 2, sum)) %>% 
  mutate(region = "SEARO", covid = sum(df.lin[searo.indl,]$covid), Nx = sum(df.lin[searo.indl,]$Nx)) 
wpro.sum  <- getsum(apply(fitdist1e[wpro.indl,], 2, sum)) %>% 
  mutate(region = "WPRO", covid = sum(df.lin[wpro.indl,]$covid), Nx = sum(df.lin[wpro.indl,]$Nx))
global.sum <- getsum(apply(fitdist1e[glob.indl,], 2, sum)) %>% 
  mutate(region = "Global", covid = sum(df.lin[glob.indl,]$covid), Nx = sum(df.lin[glob.indl,]$Nx))
#####################################################################################################################################

sum.exc.df <- rbind(afro.sum, emro.sum, euro.sum, 
                    paho.sum, searo.sum, wpro.sum, global.sum) %>% 
  rename(low = lwr, high = uppr) %>%
  mutate(excess = round(mean - covid), 
         mean = round(mean), low = round(low), high = round(high),
         region = factor(region, levels = c("AFRO", "EMRO", "EURO", "PAHO", "SEARO", "WPRO", "Global")), 
         model = "Basic Linear") %>%
  dplyr::select(model, region, Nx, covid, excess, low, mean, high)

4.2 MIXED EFFECTS MODEL

A second option is to assume a linear mixed effects model to predict the excess mortality rate. In a linear mixed effects model we let \(y_j = (y_{j1}; \ldots ; y_{jn})\) denote the vector of repeated measurements for a group \(j\). We reconsider the general model \[ y_j = X_j\beta + Zjbj + \varepsilon_j\] in which \(\beta\) is a vector of population-average regression coefficients called fixed effects, and where \(b_j\) is a vector of group-specific regression coefficients. The \(b_j\) describe how the evolution of the \(j\)th group deviates from the average. The matrices \(X_j\) and \(Z_j\) are matrices of known covariates. The residual components \(\varepsilon_j\) are assumed to be independent \(N(0, \Sigma_j)\), where \(\Sigma_j\) depends on \(j\).

In this case the variable of interest is the excess mortality rate i.e., the difference between the observed and expected relative to the overall population per 100,000. As shown in the data input tab, the data are disaggregated by quarter of deaths. For fixed effects we can take the overall covid death rates for the year, the expected death rates, the sdi and quarter specific covid, NCD and communicable disease death rates as well as the quarter. Random effects assumed for the region and income group. The model, is shown in the code below with results to follow in the final tab.

#####################################################################################################################################
# Mixed effects model in INLA, similar syntax to GLM
#####################################################################################################################################

df.mixed  <- inla(excessr ~ cd_all + expectedr + sdi * covidr + ncdr + cmr + quarter - 1 + 
                  f(WHO_region, model = "iid") + f(group, model = "iid"), 
                  data = analysis_mod, 
                  control.predictor= list(compute=TRUE))

pred.in   <- cbind(df.mixed$summary.fitted.values$mean, df.mixed$summary.fitted.values$sd) 
fit.dist3 <- t(apply(pred.in, 1, get.sample)) # Assume distribution of fit is Normal with mean and sd

ind.obs <- analysis_mod %>% filter(!is.na(excess)) %>% pull(row) # indices for no data

#####################################################################################################################################
# For the observed replace predictions with actual excess rate estimates

for (i in 1:draws){
  fit.dist3[ind.obs,] <- analysis_mod$excessr[ind.obs]
}

fit.dist3e   <- analysis_mod$Nx     * fit.dist3 

#####################################################################################################################################
# Results are by quarter, need to derive draws by country and then use those to derive draws by region etc
fit.dist3ef1 <- array(dim = c(194, draws))
for (j in 1:194){
  fit.dist3ef1[j,] <- apply(fit.dist3e[(j*4 - 3):(j*4),], 2, sum)
}

#####################################################################################################################################
# Summarize with bootstrap confidence intervals based on the distribution of the draws
df.mixed.out   <- data.table::data.table(t(apply(fit.dist3ef1, 1, getsum2))) %>%
  rename(y = "V1", low = "10%", high = "90%") %>%
  cbind(df.lin) %>% filter(!is.na(y))  %>% 
  mutate(y    = ifelse(!is.na(excess), excess, y),
         low  = ifelse(!is.na(excess), NA, low),
         high = ifelse(!is.na(excess), NA, high), 
         source = ifelse(is.na(excess), "Predicted", "Observed")) %>% arrange(-y)

The code for aggregating outputs to generate region specific uncertainty can be shown using the link below.

#####################################################################################################################################
# Identify the rows for each region to aggregate the matrices accordingly

afro.ind    <- analysis_mod %>% filter(WHO_region == "AFRO") %>% pull(row)
emro.ind    <- analysis_mod %>% filter(WHO_region == "EMRO") %>% pull(row)
euro.ind    <- analysis_mod %>% filter(WHO_region == "EURO") %>% pull(row)
paho.ind    <- analysis_mod %>% filter(WHO_region == "AMRO") %>% pull(row)
searo.ind   <- analysis_mod %>% filter(WHO_region == "SEARO") %>% pull(row)
wpro.ind    <- analysis_mod %>% filter(WHO_region == "WPRO") %>% pull(row)
glob.ind    <- analysis_mod %>% pull(row)

#####################################################################################################################################
# Summarise within each aggregate, first by column (draw) and then across draws by aggregate

afro.sum.exc   <- getsum(apply(fit.dist3e[afro.ind,], 2, sum)) %>% 
  mutate(region = "AFRO", covid = sum(analysis_mod[afro.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[afro.ind,]$Nx)) 
emro.sum.exc   <- getsum(apply(fit.dist3e[emro.ind,], 2, sum)) %>% 
  mutate(region = "EMRO", covid = sum(analysis_mod[emro.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[emro.ind,]$Nx)) 
euro.sum.exc   <- getsum(apply(fit.dist3e[euro.ind,], 2, sum)) %>% 
  mutate(region = "EURO", covid = sum(analysis_mod[euro.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[euro.ind,]$Nx)) 
paho.sum.exc   <- getsum(apply(fit.dist3e[paho.ind,], 2, sum)) %>% 
  mutate(region = "PAHO", covid = sum(analysis_mod[paho.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[paho.ind,]$Nx)) 
searo.sum.exc  <- getsum(apply(fit.dist3e[searo.ind,], 2, sum)) %>% 
  mutate(region = "SEARO", covid = sum(analysis_mod[searo.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[searo.ind,]$Nx)) 
wpro.sum.exc  <- getsum(apply(fit.dist3e[wpro.ind,], 2, sum)) %>% 
  mutate(region = "WPRO", covid = sum(analysis_mod[wpro.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[wpro.ind,]$Nx))
global.sum.exc <- getsum(apply(fit.dist3e[glob.ind,], 2, sum)) %>% 
  mutate(region = "Global", covid = sum(analysis_mod[glob.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[glob.ind,]$Nx))

#####################################################################################################################################
sum.exc.exc <- rbind(afro.sum.exc, emro.sum.exc, euro.sum.exc, 
                     paho.sum.exc, searo.sum.exc, wpro.sum.exc, global.sum.exc) %>% 
  rename(low = lwr, high = uppr) %>%
  mutate(excess = round(mean - covid), 
         mean = round(mean), low = round(low), high = round(high),
         region = factor(region, levels = c("AFRO", "EMRO", "EURO", "PAHO", "SEARO", "WPRO", "Global")), 
         model = "Mixed Effects") %>%
  dplyr::select(model, region, Nx, covid, excess, low, mean, high)

5 NEGATIVE BINOMIAL MODELS

Serge Aleshin-Guendel, Jon Wakefield and William Msemburi

May 13, 2021

5.1 Context

For a country \(c\) with total mortality \(Y_c\) for 2020 we will report \[Y_c-E_c,\] where \(E_c\) is the expected deaths, based on a time series model applied to previous year’s data. For a country \(c\) without totaly mortality we instead use \[\widehat{Y}_c - E_c,\] where \(\widehat{Y}_c\) is the fitted value from a model based on covariates \(X_c\). In the next section we discuss models.

5.2 Models under consideration

Suppose we have the recorded number of deaths \(Y_c\) in \(2020\) and covariates \(X_c\) for each of \(C\) countries in the AMRO and EURO regions. We will explore \(3\) negative-binomial models for \(Y_c\), which take the form \[Y_c\sim \textsf{Negative-Binomial}(\mu_c, \phi),\] where \(E[Y_c]=\mu_c\) and \(\phi\) is an overdisperison parameter such that \(Var(Y_C)=\mu_c+\mu_c^2/\phi\):

  • Model 1: \(\mu_c=N_c \exp(X_c\beta)\), where \(N_c\) is the population size of country \(c\),
  • Model 2: \(\mu_c=E_c \exp(X_c\beta)\), where \(E_c\) is the expected number of deaths in \(2020\) for country \(c\) based on previous years’ deaths,
  • Model 3: \(\mu_c=E_c + N_c \exp(X_c\beta)\).

We will then use the fitted models to predict the number of excess deaths for countries without recorded number of deaths in \(2020\).

5.3 Covariate Exploration

Before fitting the models, we explore the relationships between the covariates and the responses. First we plot \(\log(Y/N)\) vs. each of the covariates, and overlay a loess smoother:

Next we plot \(\log(Y/E)\) vs. each of the covariates, and overlay a loess smoother:

Finally we plot \(\log((Y-E)/N)\) vs. each of the covariates, and overlay a loess smoother:

5.4 Fitting the models and residual analyses

We fit each of the \(3\) candidate models in Stan using \(N(0, 5^2)\) priors on regression coefficients and \(N^+(0, 100^2)\) (i.e. a half-normal) priors on the overdispersion parameter (this prior should be explored more). Covariates were centered and scaled to have mean \(0\) and standard deviation \(1\).

We will first examine these fits through residual analyses. In particular for each model, for each of the \(C\) countries we can look at the posterior mean for \(\mu_c\) vs. the observed \(Y_c\), the posterior mean for \(Y_c-\mu_c\) vs. the posterior mean for \(\mu_c\), and the posterior mean for \((Y_c-\mu_c)/\sqrt{\mu_c+\mu_c^2/\phi}\) vs. the posterior mean for \(\mu_c\).

5.5 Leave one out metrics

We can additionally perform a leave one out cross validation residual analysis, where we fit each of the models \(C\) times, each time leaving out one of the countries, and then look at the residuals for the left out countries.

In addition to looking at residuals, using leave one out cross validation we can compute the expected log-predictive density (elpd), also known as the log CPO: \[ \sum_{c=1}^C\log\left[\int p(Y_c\mid \theta)p(\theta\mid Y_{-c})d\theta\right] \] where \(\theta\) are all model parameters, \(p(Y_c\mid \theta)\) is the predictive density for country \(c\), and \(p(\theta\mid Y_{-c})\) is the posterior conditional on all of the countries’ data except for country \(c\).

## [1] "Model 1: Offset N"
##      Estimate       SE
## elpd -685.375 15.40145
## ic   1370.750 30.80290
## [1] "Model 2: Offset E"
##       Estimate       SE
## elpd -655.3858 16.45395
## ic   1310.7716 32.90789
## [1] "Model 3: Additive"
##       Estimate       SE
## elpd -651.2184 16.01878
## ic   1302.4369 32.03756

Higher elpd is better (or equivalently lower ic, which is just -2*elpd), and so we see that models 2 and 3 have the best fits according to elpd.

5.6 Predictions

For a given country \(i\) for which we do not have the number of recorded deaths \(Y_i\) in \(2020\), we can predict \(Y_i\) using the fitted \(\mu_i\) under each of the models. For each of the countries in the AMRO and EURO regions which did not have the number of recorded deaths in \(2020\), we now plot the posterior median and \(95\%\) credible intervals for the number of excess deaths, \(\mu_i-E_i\). For the \(C\) countries for which we had the number of recorded deaths in \(2020\), we plot the observed \(Y_c-E_c\)

Here’s the same plot but slightly zoomed in:

We can also see how these excess death estimates compare to the number of reported COVID19 deaths in 2020, for the countries for which we’re making predictions.

Further we can get region level predictions of the number of excess deaths:

## [1] "Model 1: Offset N"
##   region     est   lower   upper
## 1   AMRO 1196879 1142166 1257225
## 2   EURO 1099435 1042104 1162357
## [1] "Model 1: Offset E"
##   region     est   lower   upper
## 1   AMRO 1418871 1373582 1467314
## 2   EURO 1138242 1096769 1184846
## [1] "Model 1: Additive"
##   region     est   lower   upper
## 1   AMRO 1411024 1364107 1471211
## 2   EURO 1143649 1117139 1209144

6 RESULTS


6.1 Regional comparisons across models


6.2 Regional comparisons for each model


6.2.1 LINEAR MODEL

6.2.1.1 Option 1 country excess


6.2.1.2 Option 1 regional and global

Overall for the year 2020:

  • Global: 1.81 million COVID-19 and 3.31 million estimated excess deaths (UI 2.6-3.98)

    • AFRO: 43 thousand COVID-19 and 106 thousand estimated excess deaths (UI -12-217)

    • EMRO: 120 thousand COVID-19 and 287 thousand estimated excess deaths (UI 227-346)

    • EURO: 586 thousand COVID-19 and 1.13 million estimated excess deaths (UI 1.12-1.14)

    • PAHO: 860 thousand COVID-19 and 1.32 million estimated excess deaths (UI 1.3-1.34)

    • SEARO: 184 thousand COVID-19 and 446 thousand estimated excess deaths (UI 194-687)

    • WPRO: 20 thousand COVID-19 and 24 thousand estimated excess deaths (UI -258-282)


6.2.1.3 Barplots of option 1 regional excess deaths


6.2.2 MIXED-EFFECTS MODEL

6.2.2.1 Option 2 country excess

6.2.2.2 Option 2 regional and global

Overall for the year 2020:

  • Global: 1.81 million COVID-19 and 3.31 million estimated excess deaths (UI 3.01-3.62)

    • AFRO: 43 thousand COVID-19 and 193 thousand estimated excess deaths (UI 123-262)

    • EMRO: 120 thousand COVID-19 and 385 thousand estimated excess deaths (UI 302-469)

    • EURO: 586 thousand COVID-19 and 1.11 million estimated excess deaths (UI 1.1-1.12)

    • PAHO: 860 thousand COVID-19 and 1.33 million estimated excess deaths (UI 1.32-1.34)

    • SEARO: 184 thousand COVID-19 and 501 thousand estimated excess deaths (UI 281-730)

    • WPRO: 20 thousand COVID-19 and -205 thousand estimated excess deaths (UI -387–19)


6.2.2.3 Barplots of option 2 regional excess deaths


6.2.3 Comparison by model (regional and global)